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Abstract 

We  revisit  the  classical  problem  of  scalar  replacement  of  array  elements  and  pointer  accesses.  We  generalize 
the  state-of-the-art  algorithm,  by  Carr  and  Kennedy  |1CK94I|.  to  handle  a  combination  of  both  conditional 
control-flow  and  inter-iteration  data  reuse.  The  basis  of  our  algorithm  is  to  make  the  dataflow  availability 
information  precise  using  a  technique  we  call  SIDE:  Statically  Instantiate  and  Dynamically  Evaluate.  In 
SIDE  the  compiler  inserts  explicit  code  to  evaluate  the  dataflow  information  at  runtime. 

Our  algorithm  operates  within  the  same  assumptions  of  the  classical  one  (perfect  dependence  information), 
and  has  the  same  limitations  (increased  register  pressure).  It  is,  however,  optimal  in  the  sense  that  within 
each  code  region  where  scalar  promotion  is  applied,  given  sufficient  registers,  each  memory  location  is  read 
and  written  at  most  once. 
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1  Introduction 


The  goal  of  scalar  replacement  (also  called  register  promotion)  is  to  identify  repeated  accesses  made  to  the 
same  memory  address,  either  within  the  same  iteration  or  across  iterations,  and  to  remove  the  redundant 
accesses  (here  we  only  study  promotion  within  the  innermost  loop  bodies,  but  the  ideas  we  present  are  ap¬ 
plicable  to  wider  code  regions  as  well).  The  state-of-the-art  algorithm  for  scalar  replacement  was  proposed 
in  1994  by  Steve  Carr  and  Ken  Kennedy  [ICK94I]  This  algorithm  handles  very  well  two  special  instances  of 
the  scalar  replacement  problem:  (1)  repeated  accesses  made  within  the  same  loop  iteration  in  code  having 
arbitrary  conditional  control-flow;  and  (2)  code  with  repeated  accesses  made  across  iterations  in  the  ab¬ 
sence  of  conditional  control-flow.  For  (1)  the  algorithm  relies  on  PRE,  while  for  (2)  it  relies  on  dependence 
analysis  and  rotating  scalar  values.  However,  that  algorithm  cannot  handle  arbitrary  combinations  of  both 
conditional  control-flow  and  inter-iteration  reuse  of  data. 

Here  we  present  a  very  simple  algorithm  which  generalizes  and  simplifies  the  Carr-Kennedy  algorithm 
in  an  optimal  way.  The  optimality  criterion  that  we  use  throughout  this  paper  is  the  number  of  dynamically 
executed  memory  accesses.  After  application  of  our  algorithm  on  a  code  region,  no  memory  location  is 
read  more  than  once  and  written  more  than  once  in  that  region.  Also,  after  promotion,  no  memory  location 
is  read  or  written  if  it  was  not  so  in  the  original  program  (i.e.,  our  algorithm  does  not  perform  speculative 
promotion).  Our  algorithm  operates  under  the  same  assumptions  as  the  Carr-Kennedy  algorithm.  That  is, 
it  requires  perfect  dependence  information  to  be  applicable.  It  is  therefore  mostly  suitable  for  FORTRAN 
benchmarks.  We  have  implemented  our  algorithm  in  a  C  compiler,  and  we  have  found  numerous  instances 
where  it  is  applicable  as  well. 

For  the  impatient  reader,  the  key  idea  is  the  following:  for  each  value  to  be  scalarized,  the  compiler 
creates  a  1-bit  runtime  flag  variable  indicating  whether  the  scalar  value  is  “valid.”  The  compiler  also  creates 
code  which  dynamically  updates  the  flag.  The  flag  is  then  used  to  detect  and  avoid  redundant  loads  and  to 
indicate  whether  a  store  has  to  occur  to  update  a  modified  value  at  loop  completion.  This  algorithm  ensures 
that  only  the  first  load  of  a  memory  location  is  executed  and  only  the  last  store  takes  place.  This  algorithm 
is  a  particular  instance  of  a  new  general  class  of  algorithms:  it  transforms  values  customarily  used  only  at 
compile-time  for  dataflow  analysis  into  dynamic  objects.  Our  algorithm  instantiates  availability  dataflow 
information  into  run-time  objects,  therefore  achieving  dynamic  optimality  even  in  the  presence  of  constructs 
which  cannot  be  statically  optimized. 

We  introduce  the  algorithm  by  a  series  of  examples  which  show  how  it  is  applied  to  increasingly  com¬ 
plicated  code  structures.  We  start  in  Section  @  by  showing  how  the  algorithm  handles  a  special  case,  that 
of  memory  operations  from  loop-invariant  addresses.  In  Section  we  show  how  the  algorithm  optimizes 
loads  whose  addresses  are  induction  variables.  Finally,  we  show  how  stores  can  be  treated  optimally  in 
Section  In  Section  ^  we  describe  two  implementations  of  our  algorithm:  one  based  on  control-flow 
graphs  (CGFs),  and  one  relying  on  a  special  form  of  Static-Single  Assignment(SSA)  named  Pegasus.  Al¬ 
though  the  CFG  variant  is  simpler  to  implement,  Pegasus  simplifies  the  dependence  analysis  required  to 
determine  whether  promotion  is  applicable.  Special  handling  of  loop-invariant  guarding  predicates  is  dis¬ 
cussed  in  Section  ^  Finally,  in  Section  0,  we  quantify  the  impact  of  an  implementation  of  this  algorithm 
when  applied  to  the  innermost  loops  of  a  series  of  C  programs. 

This  paper  makes  the  following  new  research  contributions: 


•  it  introduces  the  SIDE  class  of  dataflow  analyses,  in  which  the  analysis  is  carried  statically,  but  the 
computation  of  the  dataflow  information  is  performed  dynamically,  creating  dynamically  optimal 
code  for  constructs  which  cannot  be  statically  made  optimal; 

*In  this  paper  we  do  not  consider  speculative  promotion,  which  has  been  extensively  studied  since  then. 


while 

(1)  { 

if 

(condl)  statementl ; 

if 

(cond2)  statement2; 

if 

} 

(cond_breakl )  break; 

Figure  1 :  For  ease  of  presentation  we  assume  that  prior  to  register  promotion,  all  loop  bodies  are  predicated. 


•  it  introduces  a  new  register-promotion  algorithm  as  a  SIDE  dataflow  analysis; 

•  it  introduces  a  linear-time^  term-rewriting  algorithm  for  performing  inter-iteration  register  promotion 
in  the  presence  of  control-flow; 

•  it  describes  register  promotion  as  implemented  in  Pegasus,  showing  how  it  takes  advantage  of  the 
memory  dependence  representation  for  effective  dependence  analysis. 

1.1  Conventions 

We  present  all  the  optimizations  examples  as  source-to-source  transformations  of  schematic  C  program 
fragments.  For  simplicity  of  the  exposition  we  assume  that  we  are  optimizing  the  body  of  an  innermost 
loop.  We  also  assume  that  none  of  the  scalar  variables  in  our  examples  have  their  address  taken.  We  write 
f  { i )  to  denote  an  arbitrary  expression  involving  i  which  has  no  side  effects  (but  not  a  function  call).  We 
write  for  (i)  to  denote  a  loop  having  i  as  a  basic  induction  variable;  we  assume  that  the  loop  body  is 
executed  at  least  once. 

For  pedagogical  purposes,  the  examples  we  present  all  assume  that  the  code  has  been  brought  into  a 
canonical  form  through  the  use  of  if -conversion  [TAKPWH^.  such  that  each  memory  statement  is  guarded 
by  a  predicate;  i.e.,  the  code  has  the  shape  in  Figure  |T].  Our  algorithms  are  easily  generalized  to  handle 
nested  natural  loops  and  arbitrary  forward  control-flow  within  the  loop  body. 

2  Scalar  Replacement  of  Loop-Invariant  Memory  Operations 

In  this  section  we  describe  a  new  register  promotion  algorithm  which  can  eliminate  memory  references 
made  to  loop-invariant  addresses  in  the  presence  of  control  flow.  This  algorithm  is  further  expanded  in  Sec¬ 
tion  and  Section  to  promote  memory  accesses  into  scalars  when  the  memory  references  have  a 
constant  stride. 

2.1  The  Classical  Algorithm 

Figure  @  shows  a  simple  example  and  how  it  is  transformed  by  the  classical  scalar  promotion  algorithm. 
Assuming  p  cannot  point  to  i,  the  key  fact  is  *p  always  loads  from  and  stores  to  the  same  address,  therefore 
*p  can  be  transformed  into  a  scalar  value.  The  load  is  lifted  to  the  loop  pre-header,  while  the  store  is  moved 

^This  time  does  not  include  the  time  to  compute  the  dependences,  only  the  actual  register  promotion  transformation. 
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for 

(i) 

*p  +=  i; 

tmp 

=  *p; 

for 

(i) 

tmp  +=  i ; 

*p 

=  tmp; 

Figure  2:  A  simple  program  before  and  after  register  promotion  of  loop-invariant  memory  operations. 


for  (i) 

if  (i  &  1) 

*p  +=  i; 


Figure  3:  A  small  program  that  is  not  amenable  to  classical  register  promotion. 


after  the  loop.  (The  latter  is  slightly  more  difficult  to  accomplish  if  the  loop  has  multiple  exits  going  to 
multiple  destinations.  Our  implementation  handles  these  as  well,  as  described  in  Section  |4.2.2|.) 

2.2  Loop-Invariant  Addresses  and  Control-Flow 

However,  the  simple  algorithm  is  no  longer  applicable  to  the  slightly  different  Figure  ^  Lifting  the  load  or 
store  out  of  the  loop  may  be  unsafe  with  respect  to  exceptions:  one  cannot  lift  a  memory  operation  out  of  a 
loop  it  if  may  never  be  executed  within  the  loop. 

To  optimize  Figure  it  is  enough  to  maintain  a  va  1  i  d  bit  in  addition  to  the  the  t  mp  scalar.  The  valid 
bit  indicates  whether  tmp  indeed  holds  the  value  of  *p,  as  in  Figure  0.  The  valid  bit  is  initialized  to  false. 
A  load  from  *p  is  performed  only  if  the  valid  bit  is  false.  Either  loading  from  or  storing  to  *p  sets  the 
valid  bit  to  true.  This  program  will  forward  the  value  of  *p  through  the  scalar  tmp  between  iterations 
arbitrarily  far  apart. 

The  insight  is  that  it  may  be  profitable  to  compute  dataflow  information  at  runtime.  For  example,  the 
valid  flag  within  an  iteration  is  nothing  more  than  the  dynamic  equivalent  of  the  availability  dataflow  in¬ 
formation  for  the  loaded  value,  which  is  the  basis  of  classical  Partial  Redundancy  Elimination  (PRE)  [11V1R79I|. 
When  PRE  can  be  applied  statically,  it  is  certainly  better  to  do  so.  The  problem  with  Eigure  ^  is  that  the 
compiler  cannot  statically  summarize  when  condition  { i  &  1 )  is  true,  and  therefore  has  to  act  conserva¬ 
tively,  assuming  that  the  loaded  value  is  never  available.  Computing  the  availability  information  at  run-time 
eliminates  this  conservative  approximation.  Maintaining  and  using  runtime  dataflow  information  makes 
sense  when  we  can  eliminate  costly  operations  (e.g.,  memory  accesses)  by  using  inexpensive  operations 
(e.g..  Boolean  register  operations). 

This  algorithm  generates  a  program  which  is  optimal  with  respect  to  the  number  of  loads  within  each 
region  of  code  to  which  promotion  is  applied  (if  the  original  program  loads  from  an  address,  then  the 
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/* 

prelude  */ 

tmp 

=  uninitialized; 

tmp 

_valid  =  false; 

for 

(i) 

{ 

/* 

load  from  *p  becomes:  */ 

if 

(  (i  &  1)  &&  !tmp_valid)  { 
tmp  =  *p; 
tmp_valid  =  true; 

} 

/* 

store  to  *p  becomes  */ 

if 

(i  &  1)  { 

tmp  +=  i  ; 
tmp_valid  =  true; 

} 

} 

/* 

postlude  */ 

if 

(tmp 

_valid) 

*P 

=  tmp; 

Figure  4:  Optimization  of  the  program  in  Figure^ 


optimized  program  will  load  from  that  address  exactly  once),  but  may  execute  one  extra  store:^  if  the 
original  program  loads  the  value  but  never  stores  to  it,  the  valid  bit  will  be  true,  enabling  the  postlude 
store.  In  order  to  treat  this  case  as  well,  a  dirty  flag,  set  on  writes,  has  to  be  maintained,  as  shown  in 
Figure  |.[| 

Note:  in  order  to  simplify  the  presentation,  the  examples  in  the  rest  of  the  paper  will  not  include  the 
dirty  bit.  However,  its  presence  is  required  for  achieving  an  optimal  number  of  stores. 


3  Inter-Iteration  Scalar  Promotion 

Here  we  extend  the  algorithm  for  promoting  loop-invariant  operations  to  perform  scalar  promotion  of 
pointer  and  array  variables  with  constant  stride.  We  assume  that  the  code  has  been  subjected  to  standard 
dependence  analysis  prior  to  scalar  promotion. 

3.1  The  Carr- Kennedy  Algorithm 

Figure  ^  illustrates  the  classical  Carr- Kennedy  inter-iteration  register  promotion  algorithm  from  [1CCK9()I]. 
which  is  only  applicable  in  the  absence  of  control-flow.  In  general,  reusing  a  value  after  k  iterations  re¬ 
quires  the  creation  of  k  distinct  scalar  values,  to  hold  the  simultaneously  live  values  of  a  [  i  ]  loaded  for  k 

^However,  this  particular  program  is  optimal  for  stores  as  well. 

"'The  dirty  bit  may  also  be  required  for  correctness,  if  the  value  is  read-only  and  the  writes  within  the  loop  are  always 
dynamically  predicated  “false.” 
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/*  prelude  */ 
tmp  =  uninitialized; 
tmp_valid  =  false; 
tmp_dirty  =  false; 

for  (i) 

{ 

/* 

load  from  *p  becomes:  */ 

if 

(  (i  &  1)  &&  !tmp_valid)  { 
tmp  =  *p; 

} 

tmp_valid  =  true; 

/* 

store  to  *p  becomes  */ 

if 

(i  &  1)  { 

tmp  +=  i  ; 
tmp_valid  =  true; 

} 

\ 

tmp_dirty  =  true; 

/*  postlude  */ 

i f  { tmp 

_dirty ) 

*P 

=  tmp; 

Figure  5:  Optimization  of  store  handling  from  Figure^. 


consecutive  values  of  i.  This  quickly  creates  register  pressure  and  therefore  heuristics  are  usually  used  to 
decide  whether  promotion  is  beneficial.  Since  register  pressure  has  been  very  well  addressed  in  the  literature 
[1CCK9()L  IMuc97L  ICMS96L ICW95I].  we  will  not  concern  ourselves  with  it  anymore  in  this  text. 

A  later  extension  to  the  Carr-Kennedy  algorithm  [ICK94I]  allows  it  to  also  handle  control  flow.  The 
algorithm  optimally  handles  reuse  of  values  within  the  same  iteration,  by  using  PRE  on  the  loop  body. 
However,  this  algorithm  can  no  longer  promote  values  across  iterations  in  the  presence  of  control-flow. 
The  compiler  has  difficulty  in  reasoning  about  the  intervening  updates  between  accesses  made  in  different 
iterations  in  the  presence  of  control-flow. 


3.2  Partial  Redundancy  Elimination 

Before  presenting  our  solution  let  us  note  that  even  the  classical  PRE  algorithm  (without  the  support  of 
special  register  promotion)  is  quite  successful  in  optimizing  loads  made  in  consecutive  iterations.  Eigure  0 
shows  a  sample  loop  and  its  optimization  by  gcc,  which  does  not  have  a  register  promotion  algorithm  at 
all.  By  using  PRE  alone  gcc  manages  to  reuse  the  load  from  ptr2  one  iteration  later. 

The  PRE  algorithm  is  unable  to  achieve  the  same  effect  if  data  is  reused  in  any  iteration  other  than  the 
immediately  following  iteration  or  if  there  are  intervening  stores.  In  such  cases  an  algorithm  like  Carr- 
Kennedy  is  necessary  to  remove  the  redundant  accesses.  Eet  us  notice  that  the  use  of  valid  flags  achieves 
the  same  degree  of  optimality  as  PRE  within  an  iteration,  but  at  the  expense  of  maintaining  run-time  infor¬ 
mation. 
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for  (i  =  2;  i  <  N;  i++) 
a  [i]  =  a  [i]  +  a  [i-2  ]  ; 


/*  pre-header  */ 


aO  =  a [ 0 ; 

1 ; 

/*  invariant : 

aO 

=  a[i-2] 

*/ 

al  =  a[i; 

1 ; 

/* 

al 

=  a[i-l] 

*/ 

for  (1  = 

2;  1 

<  N ;  1 ++ )  { 

a2  = 

a  [  1  ]  ; 

/* 

a2 

=  a  [  1  ] 

*/ 

a2  = 

aO  + 

a2 ; 

a  [i] 

=  a2 ; 

/*  Rotate  scalar  values  */ 
a  0  =  a  1  ; 
al  =  a2; 

} 


Figure  6:  Program  with  no  control-flow  before  and  afer  register  promotion  performed  by  the  Carr-Kennedy 
algorithm. 


do  ■ 

( 

*ptrl++  = 

*ptr2++; 

}  while ( — cnt 

&&  *ptr2 )  ; 

tmp 

=  *ptr2; 

do  ■ 

( 

*ptrl++  = 

tmp; 

ptr2++; 

if  ( — cnt) 

break; 

tmp  =  *ptr 

2; 

if  { !  tmp) 

break; 

}  while  { 1 )  ; 

Figure  7:  Sample  loop  and  its  optimization  using  PRE.  (The  output  is  the  equivalent  of  the  assembly  code 
generated  by  gcc.)  PRE  can  achieve  some  degree  of  register  promotion  for  loads. 
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for  (i  =2;  i  <  N;  i++) 

if  (f (i) )  a[i]  =  a[i]  +  a[i-2] ; 


Figure  8:  Sample  program  which  cannot  be  handled  optimally  by  either  PRE  or  the  classical  Carr-Kennedy 
algorithm. 

3.3  Removing  All  Redundant  Loads 

However,  the  classical  algorithm  is  unable  to  promote  all  memory  references  guarded  by  a  conditional,  as 
in  Figure  It  is,  in  general,  impossible  for  a  compiler  to  check  when  f  { i )  is  true  in  both  iteration  i  and 
in  iteration  i-2,  and  therefore  it  cannot  deduce  whether  the  load  from  a  [  i  ]  can  be  reused  as  a  [  i-2  ]  two 
iterations  later. 

Register  promotion  has  the  goal  of  only  executing  the  first  load  and  the  last  store  of  a  variable.  The 
algorithm  in  Section  @  for  handling  loop-invariant  data  is  immediately  applicable  for  promoting  loads  across 
iterations,  since  it  performs  a  load  as  soon  as  possible.  By  maintaining  availability  information  at  runtime, 
using  valid  flags,  our  algorithm  can  transform  the  code  to  perform  a  minimal  number  of  loads  as  in 
Figure  ^  Applying  constant  propagation  and  dead-code  elimination  will  simplify  this  code  by  removing 
the  unnecessary  references  to  a2_valid. 

3.4  Removing  All  Redundant  Stores 

Handling  stores  seems  to  be  more  difficult,  since  one  should  forgo  a  store  if  the  value  will  be  overwritten 
in  a  subsequent  iteration.  However,  in  the  presence  of  control-flow  it  is  not  obvious  how  to  deduce  whether 
the  overwriting  stores  in  future  iterations  will  take  place.  Here  we  extend  the  register  promotion  algorithm 
to  ensure  that  only  one  store  is  executed  to  each  memory  location,  by  showing  how  to  optimize  the  example 
in  Figure  [T^. 

We  want  to  avoid  storing  to  a  [  i  -i-  2  ] ,  since  that  store  will  be  overwritten  two  iterations  later  by  the  store 
to  a  [  i  ] .  However,  this  is  not  true  for  the  last  two  iterations  of  the  loop.  Since,  in  general,  the  compiler 
cannot  generate  code  to  test  loop-termination  several  iterations  ahead,  it  looks  as  if  both  stores  must  be 
performed  in  each  iteration.  However,  we  can  do  better  than  that  by  performing  within  the  loop  only  the 
store  to  a  [  i  ] ,  which  certainly  will  not  be  overwritten.  The  loop  in  Figure  does  exactly  that.  The  loop 
body  never  overwrites  a  stored  value  but  may  fail  to  correctly  update  the  last  two  elements  of  array  a. 
Fortuitously,  after  the  loop  completes,  the  scalars  aO,  al  hold  exactly  these  two  values.  So  we  can  insert  a 
loop  postlude  to  fix  the  potentially  missing  writes.  (Of  course,  di  rty  bits  should  be  used  to  prevent  useless 
updates.) 

4  Implementation 

This  algorithm  is  probably  much  easier  to  illustrate  than  to  describe  precisely.  Since  the  important  message 
was  hopefully  conveyed  by  the  examples,  we  will  just  briefly  sketch  the  implementation  in  a  CFG-based 
framework  and  describe  in  somewhat  more  detail  the  Pegasus  implementation. 
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Figure  9:  Optimal  version  of  the  code  in  Figure 

4.1  CFG-Based  Implementation 

In  general,  for  each  constant  reference  to  a  [  i+j  ]  (for  a  compile-time  constant  j)  we  maintain  a  scalar  tj 
and  a  valid  bit  tjvalid.  Then  scalar  replacement  just  makes  the  following  changes: 

•  Replaces  every  load  from  a  [  i-i-j  ]  with  a  pair  of  statements: 

tj  =  tjvalid  ?  tj  :  a[i+j];  tjvalid  =  true 

•  Replace  every  store  a  [  i+j  ]  =  e  with  a  pair  of  statements: 
tj  =  e;  tjvalid  =  true. 
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Figure  10:  Program  with  control-how  and  redundant  stores  and  its  optimization  for  an  optimal  number  of 
loads.  The  store  to  a  [i+2]  may  be  overwritten  two  iterations  later  by  the  store  to  a  [i] . 
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a0_valid  =  true;  /*  a[i]  */ 

aO  =  a [ 0 ] ; 

al_valid  =  false; 

a2_valid  =  false; 

for  (1)  { 

fi  =  f (i) ; 

/*  load  a[i]  */ 
if  ( ! a0_valid) 
aO  =  a  [i] ; 

/*  store  a[i]  */ 
a0=a0+l ; 
a  [  1  ]  =  a  0 ; 

/*  store  a[i+2]  */ 
if  (fi)  { 

a2  =  aO; 

/*  No  update  of  a[i+2] : 
a2_valid  =  true; 

} 

may  be  overwritten  */ 

/*  Rotate  scalars  and  valid 
a  0  =  a  1  ; 
al  =  a2; 

a0_valid  =  al_valid; 
al_valid  =  a2_valid; 
a2_valid  =  false; 

} 

flags  */ 

/*  Postlude  */ 

if  (a0_valid) 
a  [  1  ]  =  a  0  ; 
if  (al_valid) 

a [ 1+1 ]  =  al ; 

Figure  1 1 :  Optimal  version  of  the  example  in  Figure  [7F|. 


Furthermore,  all  stores  except  the  generating  store^  are  removed.  Instead  compensation  code  is  added 
“after”  the  loop:  for  each  tj  append  a  statement  if  (tjvalid)  a[i+j]  =  tj. 

Complexity:  the  algorithm,  aside  from  the  dependence  analysis,  is  linear  in  the  size  of  the  loop^. 
Correctness  and  optimality:  follow  from  the  following  invariant:  the  tj valid  flag  is  true  if  and 
only  if  tj  represents  the  contents  of  the  memory  location  it  scalarizes. 

^According  to  the  terminology  in  |lJCiK9(3],  a  generating  store  is  the  one  writing  to  a  [  i+j  ]  for  the  smallest  j  promoted. 

®We  assume  that  a  constant  number  of  scalar  values  are  introduced. 
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Compiler  pass 

LOC 

Depdendence  analysis 

162 

Eoop-invariant  load/store  promotion 

100 

Register  promotion  of  array  elements 

205 

Induction  variable  analysis 

447 

Memory  disambiguation 

646 

Table  1:  Size  of  C++  code  implementing  analyses  and  optimizations  related  to  register  promotion. 


4.2  An  SSA-based  algorithm 

We  have  implemented  the  above  algorithms  in  the  C  Compiler  named  CASH.  CASH  relies  on  a  Pega¬ 
sus  [[BCj()2hL  IBCj()2aL  IBCj()3I].  a  dataflow  intermediate  representation.  In  this  section  we  briefly  describe  the 
main  features  of  Pegasus  and  then  show  how  it  enables  a  very  efficient  implementation  of  register  promo¬ 
tion. 

As  we  argued  in  [1BCj()3I|.  Pegasus  enables  extremely  compact  implementations  of  many  important  op¬ 
timizations;  register  promotion  corroborates  this  statement.  In  Table  |T]  we  shows  the  implementation  code 
size  of  all  the  analyses  and  transformations  used  by  CASH  for  register  promotion. 

4.2.1  Pegasus 

Pegasus  represents  the  program  as  a  directed  graph  where  nodes  are  operations  and  edges  indicate  value 
flow.  Pegasus  leverages  techniques  used  in  compilers  for  predicated  execution  machines  |]MTC+92i|  by 
collecting  multiple  basic  blocks  into  one  hyperblock;  each  hyperblock  is  transformed  into  straight-line 
code  through  the  use  of  the  predicated  static  single-assignment  (PSSA)  form  [iCSC+Ofll].  Instead  of  SSA 
(j)  nodes,  within  hyperblocks  Pegasus  uses  explicit  multiplexor  (mux)  nodes;  the  mux  data  inputs  are  the 
reaching  definitions.  The  mux  predicates  correspond  to  the  path  predicates  in  PSSA. 

Hyperblocks  are  stitched  together  into  a  dataflow  graph  representing  the  entire  procedure  by  creating 
dataflow  edges  connecting  each  hyperblock  to  its  successors.  Each  variable  live  at  the  end  of  a  hyperblock 
gives  rise  to  an  eta  node  [[()BM9()I].  Eta  nodes  have  two  inputs — a  value  and  a  predicate — and  one  output. 
When  the  predicate  evaluates  to  “true,”  the  input  value  is  moved  to  the  output;  when  the  predicate  evaluates 
to  “false,”  the  input  value  and  the  predicate  are  simply  consumed,  generating  no  output.  A  hyperblock  with 
multiple  predecessors  receives  control  from  one  of  several  different  points;  such  join  points  are  represented 
by  merge  nodes. 

Operations  with  side-effects  are  parameterized  with  a  predicate  input,  which  indicates  whether  the  oper¬ 
ation  should  take  place.  If  the  predicate  is  false,  the  operation  is  not  executed.  Predicate  values  are  indicated 
in  our  figures  with  dotted  lines. 

The  compiler  adds  dependence  edges  between  operations  whose  side-effects  may  not  commute.  Such 
edges  only  carry  an  explicit  synchronization  token  —  not  data.  Operations  with  memory  side-effects  (loads, 
stores,  calls,  and,  returns)  all  have  a  token  input.  When  a  side-effect  operation  depends  on  multiple  other 
operations  (e.g.,  a  write  operation  following  a  set  of  reads),  it  must  collect  one  token  from  each  of  them. 
Eor  this  purpose  a  combine  operator  is  used;  a  combine  has  multiple  token  inputs  and  a  single  token  output; 
the  output  is  generated  after  it  receives  all  its  inputs.  In  figures  (e.g.,  see  Eigure  |T^  dashed  lines  indicate 
token  flow  and  the  combine  operator  is  depicted  by  a  “V”.  Token  edges  explicitly  encode  data  flow  through 
memory.  In  fact,  the  token  network  can  be  interpreted  as  an  SSA  form  for  the  memory  values,  where  the 


11 


Figure  12:  Pegasus  representation  of  the  loop  from  Figure  @  (the  loop  bound  is  10),  before  register  promo¬ 
tion.  Solid  lines  represent  value  flow,  dotted  lines  indicate  predicate  how  and  dashed  lines  represent  tokens. 
The  up-triangles  are  merge  operators  (i.e.,  SSA  cj)  nodes),  and  the  down-triangles  are  eta  operators.  The 
dark  eta  sends  the  token  value  out  of  the  loop  on  loop  completion.  The  merge-eta  nodes  labeled  “@a”  are 
used  to  carry  the  token  mediating  all  accesses  to  array  a. 

combine  operator  is  similar  to  a  (f>  function.  The  tokens  encode  both  true-,  output-  and  anti-dependences, 
and  they  are  “may”  dependences.  In  Figure  [T^iA)  there  is  one  load  and  two  stores.  A  load  is  denoted 
by  “=  [  ]  ”  and  has  3  inputs:  address,  predicate  and  token;  it  produces  two  outputs:  the  loaded  value  and 
another  token.  A  store  is  denoted  by  “  [  ]  =”  and  has  four  inputs:  address,  data,  predicate  and  token;  the 
only  output  is  a  token. 

4.2.2  Register  Promotion  in  Pegasus 

We  sketch  the  most  important  analysis  and  transformation  steps  carried  out  by  CASH  for  register  promotion. 
Although  the  actual  promotion  in  Pegasus  is  slightly  more  complicated  than  in  a  CFG-based  representation 
(because  of  the  need  to  maintain  ^-nodes),  the  dependence  tests  used  to  decide  whether  promotion  can  be 
applied  are  much  simpler:  the  graph  will  have  a  very  restricted  structure  if  promotion  can  be  applied.^]  The 
key  element  of  the  representation  is  the  token  edge  network  whose  structure  can  be  quickly  analyzed  to 
determine  important  properties  of  the  memory  operations. 

We  illustrate  register  promotion  on  the  example  in  Figure  ||. 

1.  The  token  network  for  the  Pegasus  representation  is  shown  in  Figure  |T^.  Memory  accesses  that 
may  interfere  with  each  other  will  all  belong  to  a  same  connected  component  of  the  token  network. 
Operations  that  belong  to  distinct  components  of  the  token  network  commute  and  can  therefore  be 
analyzed  separately.  In  this  example  there  is  a  single  connected  component,  corresponding  to  accesses 
made  to  the  array  a. 

’The  network  encodes  relatively  simple  dependence  information.  However,  as  pointed  in  elementary  dependence 

tests  are  sufficient  for  most  cases  of  register  promotion. 
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Figure  13:  Token  network  for  the  Pegasus  representation  of  the  loop  in  Figure 

2.  The  addresses  of  the  three  memory  operations  in  this  component  are  analyzed:  they  are  all  determined 
to  be  induction  variables  having  the  same  step,  1 .  This  implies  that  the  dependence  distances  between 
these  accesses  are  constant  (i.e.,  iteration-independent),  making  these  accesses  candidates  for  register 
promotion. 

The  induction  step  of  the  addresses  indicates  the  type  of  promotion:  a  0  step  indicates  loop-invariant 
accesses,  while  a  non-zero  step,  as  in  this  example,  indicates  strided  accesses. 

3.  The  token  network  is  further  analyzed.  Notice  that  prior  to  register  promotion,  memory  disambigua¬ 
tion  has  already  proved  (based  on  symbolic  computation  on  address  expressions)  that  the  accesses  to 
a  [  i  ]  and  a  [  i-i-2  ]  commute,  and  therefore  there  is  no  token  edge  between  them.  The  token  network 
for  a  consists  of  two  strands:  one  for  the  accesses  to  a  [  i  ] ,  and  one  for  a  [  i-i-2  ]  ;  the  strands  are 
generated  at  the  mu,  on  top,  and  joined  before  the  etas,  at  the  bottom,  using  a  combine  (V).  If  and 
only  if  all  memory  accesses  within  the  same  strand  are  made  to  the  same  address  can  promotion  be 
carried. 

CASH  generates  the  initialization  for  the  scalar  temporaries  and  the  “valid”  bits  in  the  loop  pre-header. 
We  do  not  illustrate  this  step. 

4.  Each  strand  is  scanned  from  top  to  bottom  (from  the  mu  to  the  eta),  term-rewriting  each  memory 
operation: 

•  Figure  [T^  shows  how  a  load  operation  is  transformed  by  register  promotion.  The  resulting 
construction  can  be  interpreted  as  follows:  “If  the  data  is  already  valid  do  not  do  the  load  (i.e., 
the  load  predicate  is  ‘and’-ed  with  the  negation  of  the  valid  bit)  and  use  the  data.  Otherwise 
do  the  load  if  its  predicate  indicates  it  needs  to  be  executed.”  The  multiplexor  will  select  either 
the  load  output  or  the  initial  data,  depending  on  the  predicates.  If  neither  predicate  is  true,  the 
output  of  the  mux  is  not  defined,  and  the  resulting  valid  bit  is  false. 

•  Figure  0  shows  the  term-rewriting  process  for  a  store.  After  this  transformation,  all  stores 
except  the  generating  store  are  removed  from  the  graph  (for  this  purpose  the  token  input  is 
connected  directly  to  the  token  output,  as  described  in  [IBCj()3I]1.  The  resulting  construction  is 
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Figure  14:  Term-rewriting  of  loads  for  register  promotion,  “d”  and  “valid”  are  the  register-promoted  data 
and  its  valid  bit  respectively. 


Figure  15:  Term-rewriting  of  stores  for  register  promotion. 


interpreted  as  follows:  “If  the  store  occurs,  the  data- to-be- stored  replaces  the  register-promoted 
data,  and  it  becomes  valid.  Otherwise,  the  register-promoted  data  remains  unchanged.” 

5.  Code  is  synthesized  to  shift  the  scalar  values  and  predicates  around  between  strands  (the  assignments 
t  j_i  =  tj),  as  illustrated  in  Figure  |^. 

6.  The  insertion  of  a  loop  postlude  is  somewhat  more  difficult  in  general  than  a  loop  prelude,  since  by 
definition  natural  loops  have  a  unique  entry  point,  but  may  have  multiple  exits.  In  our  implementation 
each  loop  body  is  completely  predicated  and  therefore  all  instructions  get  executed,  albeit  some  are 
nullified  by  the  predicates.  The  compensating  stores  are  added  to  the  loop  body  and  executed  only 
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Figure  16:  Shifting  scalar  values  between  strands.  A  similar  network  is  used  to  shift  the  “valid”  bits. 


for  (i) 

{ 

if 

(cl)  *p  +=  1; 

if 

(c2)  *p  +=  2; 

if 

} 

(f  (i)  )  *p  +=  i; 

Figure  17:  Sample  code  with  loop-invariant  memory  accesses,  cl  and  c2  stand  for  loop-invariant  expres¬ 
sions. 


during  the  last  iteration.  This  is  achieved  by  making  the  predicate  controlling  these  stores  to  be  the 
loop-termination  predicate.  This  step  is  not  illustrated. 


5  Handling  Loop-Invariant  Predicates 

The  register  promotion  algorithm  described  above  can  be  improved  by  handling  specially  loop-invariant 
predicates.  If  the  disjunction  of  the  predicates  guarding  all  the  loads  and  stores  of  a  same  location  contains 
a  loop-invariant  subexpression,  then  the  initialization  load  can  be  lifted  out  of  the  loop  and  guarded  by  that 
subexpression.  Consider  Figure  [T^  on  which  we  apply  loop-invariant  scalar-promotion. 

By  applying  our  register  promotion  algorithm  one  gets  the  result  in  Figure  |^.  However,  using  the  fact 
that  cl  and  c2  are  loop-invariant  the  code  can  be  optimized  as  in  Figure  |T^.  Both  Figure  and  Figure  [T9| 
execute  the  same  number  of  loads  and  stores,  and  therefore,  by  our  optimality  criterion,  are  equally  good. 
However,  the  code  in  Figure  [T^  is  obviously  superior. 

We  can  generalize  this  observation:  the  code  can  be  improved  whenever  the  disjunction  of  all  condi¬ 
tions  guarding  loads  or  stores  from  *p  is  weaker  than  some  loop-invariant  expression  (even  if  none  of  the 
conditions  is  itself  loop-invariant),  such  as  in  Figure  In  this  case  the  disjunction  of  all  predicates  is 
f  { i )  I  I  !  f  { i )  which  is  constant  “true.”  Therefore,  the  load  from  *p  can  be  unconditionally  lifted  out  of 
the  loop  as  shown  in  Figure  ^ 

In  general,  let  us  assume  that  each  statement  s  is  controlled  by  predicate  with  P{s).  Then  for  each 
promoted  memory  location  a  [  i  +  j  ] : 

1.  Define  the  predicate  Pj  =  \/ sjP{sj),  where  Sj  G  {statements  accessing  a  [i+j]  }. 

2.  Write  Pj  as  the  union  of  two  predicates,  V  where  is  loop-invariant  and  is 

loop-dependent. 
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/*  prelude  */ 
tmp_valid  =  false; 

for  (1)  { 

/*  first  load  from  *p  */ 
if  (!  tmp_valid  &&  cl)  { 
tmp  =  *p; 
tmp_valid  =  true; 

} 

/*  first  store  to  *p  */ 
if  (cl)  { 

/*  tmp_valid  is  known  to  be  true  */ 
tmp  +=  1 ; 
tmp_valid  =  true; 

} 

/*  second  load  from  *p  */ 
if  (!  tmp_valid  &&  c2)  { 

tmp  =  *p; 
tmp_valid  =  true; 

} 

/*  second  store  to  *p  */ 
if  (c2)  { 

tmp  +=  2 ; 
tmp_valid  =  true; 

} 

fi  =  f  (i) ;  /*  evaluate  f (i)  only  once  */ 

/*  third  load  from  *p  */ 
if  (fi  &&  !tmp_valid)  { 
tmp  =  *p; 
tmp_valid  =  true; 

} 

/*  third  store  to  *p  */ 
if  (fi)  { 

tmp  +=  i ; 
tmp_valid  =  true; 

} 


/*  postlude  */ 
if  (tmp_valid) 
*p  =  tmp; 


Figure  18:  Optimization  of  the  code  in  Figure  [T^  without  using  the  invariance  of  some  predicates. 
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/* 

prelude  */ 

tmp 

_val 

id  =  cl 

1  1 

c2 ; 

if 

(tmp 

_valid) 

tmp 

=  *p; 

for 

(i) 

{ 

/* 

first  load 

from  *p 

redundant  */ 

/* 

first  store 

to  *p  * 

/ 

if 

(cl) 

/*  tmp 

_val 

id  is  known  to  be  true  */ 

tmp  += 

1; 

/* 

second 

load 

from  *p 

redundant  */ 

/* 

second 

store  to  *p 

*/ 

if 

(c2) 

tmp  += 

2; 

fi 

=  f (i)  ; 

/* 

third  load 

from  *p 

*/ 

if 

(fi  && 

!  tmp 

_valid) 

{ 

tmp  = 

*p; 

tmp_valid 

=  true; 

} 

/* 

third  store 

to  *p  * 

/ 

if 

(fi)  { 

tmp  += 

i; 

tmp_valid 

=  true; 

} 

} 

/* 

postlude  */ 

if 

(tmp 

_valid) 

*P 

=  tmp; 

Figure  19:  Optimization  of  the  code  in  Figure  [T^  using  the  invariance  of  cl  and  c2. 
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tmp 

—  9r 

p; 

for 

(i) 

{ 

fi 

=  f (i)  ; 

if 

(fi) 

tmp  +=  1; 

if 

(  !fi) 

\ 

tmp  =  2 ; 

*p  = 

=  tmp; 

Figure  20:  Code  with  no  loop-invariant  predicates,  but  with  loop-invariant  memory  accesses. 

3.  In  prelude  initialize  tj valid  =  Pj™. 

4.  In  prelude  initialize  tj  =  tjvalid  ?  a[io+j]  :  0.^ 

5.  The  predicate  guarding  each  statement  sj  is  strengthened:  P(sj)  :=  P{  Sj)  A 

Our  current  implementation  of  this  optimization  in  CASH  only  lifts  out  of  the  loop  the  disjunction  of 
all  predicates  which  are  actually  loop-invariant. 

6  Discussion 

6.1  Dynamic  Disambiguation 

Our  scalar  promotion  algorithm  can  be  naturally  extended  to  cope  with  a  limited  number  of  memory  ac¬ 
cesses  which  cannot  be  disambiguated  at  compile  time.  By  combining  dynamic  memory  disambiguation 
[INic89l]  with  our  scheme  to  handle  conditional  control  flow,  we  can  apply  scalar  promotion  even  when 
pointer  analysis  determines  that  memory  references  interfere.  Consider  the  example  in  Figure  even 
though  dependence  analysis  indicates  that  p  cannot  be  promoted  since  the  access  to  q  may  interfere,  the 
bottom  part  of  the  figure  shows  how  register  promotion  can  be  applied. 

This  scheme  is  an  improvement  over  the  one  proposed  by  Sastry  [K.I9KD,  which  stores  to  memory  all  the 
values  held  in  scalars  when  entering  an  un-analyzable  code  region  (which  in  this  case  is  the  region  guarded 
by  f  (i) ). 

6.2  Hardware  support 

While  our  algorithm  does  not  require  any  special  hardware  support,  certain  hardware  structures  can  improve 
its  efficiency. 

Rotating  registers  were  introduced  in  the  Cydra  5  architecture  [lDHBS9g  to  support  software  pipelining. 
These  were  used  on  Itanium  for  register  promotion  |]DKK+9^  to  shift  all  the  scalar  values  in  one  cycle. 
Rotating  predicate  registers  as  in  the  Itanium  can  rotate  the  “valid”  flags. 

Software  valid  bits  can  be  used  to  reduce  the  overhead  of  maintaining  the  valid  bits.  If  a  value 
is  reused  k  iterations  later,  then  our  algorithm  requires  the  use  of  2k  different  scalars:  k  valid  bits  and 

*io  is  the  initial  value  of  i  in  the  loop. 
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tmp 

—  -k 

p; 

tmp. 

_val 

id  =  true; 

for 

(i) 

{ 

fi 

=  f (i)  ; 

/* 

first  load 

*/ 

if 

(fi  &  ! 

tmp_ 

valid) 

tmp  = 

*p; 

/* 

first  store 

*  / 

if 

(fi)  { 

tmp  += 

1; 

tmp_valid 

=  true; 

} 

/* 

second 

store  */ 

if 

(!  fi) 

{ 

tmp  = 

2; 

tmp_valid 

=  true; 

\ 

} 

if 

(tmp 

_valid) 

*P 

=  tmp; 

tmp 

—  k 

p; 

for 

(i) 

{ 

fi 

=  f (i)  ; 

if 

(fi) 

tmp  += 

1; 

if 

(  !fi) 

tmp  = 

2; 

*p 

=  tmp; 

Figure  21 :  Optimization  of  the  code  in  Figure^  using  the  fact  that  the  disjunction  of  all  predicates  guarding 
*p  is  loop-invariant  (i.e.,  constant)  “true”  and  the  same  code  after  further  constant  propagation. 
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for  ( i 

)  { 

s  + 

=  *p; 

if 

} 

(f (i) )  *q  =  0; 

tmp_va 

lid  =  false; 

for  ( i 

)  { 

if 

(  !  tinp_valid)  { 

tmp  =  *p; 

tinp_valid  =  true; 

} 

s  + 

=  tmp; 

if 

(f(i))  { 

*q  =  0; 

if  (P  ==  q) 

/*  dynamic 

disambiguation  */ 

tmp_valid  = 

false; 

} 

} 

Figure  22:  Code  with  ambiguous  dependences  and  its  non-speculative  register  promotion  relying  on  dy¬ 
namic  disambiguation. 


k  values.  A  software-only  solution  is  to  pack  the  k  valid  bits  into  a  single  integer^  and  to  use  masking 
and  shifting  to  manipulate  them.  This  makes  rotation  very  fast,  but  testing  and  setting  more  expensive,  a 
trade-off  that  may  be  practical  on  a  wide  machine  having  “free”  scheduling  slots. 

Predicated  data  [IKCJtDI]  has  been  proposed  for  an  embedded  VLIW  processor:  predicates  are  not  at¬ 
tached  to  instructions,  but  to  data  itself,  as  an  extra  bit  of  each  register.  Predicates  are  propagated  through 
arithmetic,  similar  to  exception  poison  bits.  The  proposed  architecture  supports  rotating  registers  by  im¬ 
plementing  the  register  file  as  an  actual  large  shift  register.  These  architectural  features  would  make  the 
valid  flags  essentially  free  both  in  space  and  in  time. 

6.3  Other  Applications  of  SIDE 

This  paper  introduces  the  SIDE  framework  for  run-time  dataflow  evaluation,  and  presents  the  register  pro¬ 
motion  algorithm  as  a  particular  instance.  Register  promotion  uses  the  dynamic  evaluation  of  availability 
and  uses  predication  to  remove  memory  accesses  for  achieving  optimality.  SIDE  is  naturally  applied  to  the 
availability  dataflow  information,  because  it  is  a  forward  dataflow  analysis,  and  its  run-time  determination 
is  trivial. 

PRE  [IMR79I]  is  another  optimization  which  uses  of  availability  information  which  could  possibly  ben¬ 
efit  from  the  application  of  SIDE.  In  particular,  safe  PRE  forms  (i.e.,  which  never  introduce  new  compu¬ 
tations  on  any  path)  seem  amenable  to  the  use  of  SIDE.  While  some  forms  of  PRE,  such  as  lazy  code 

*Most  likely  promotion  across  more  iterations  than  bits  in  an  integer  requires  too  many  registers  to  be  profitable. 
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motion  |1KRS92I|.  are  optimal,  they  do  incur  small  overheads;  for  example,  safety  and  optimality  together 
require  the  restructuring  of  control-flow,  for  example  by  splitting  some  critical  CFG  edges|^  A  technique 
such  as  SIDE  could  be  used  on  a  predicated  architecture  to  trade-off  the  creation  of  additional  basic  blocks 
against  conditionally  computing  the  redundant  expression. 

The  technique  used  by  Bodfk  et  al.  in  [1BG97I]  can  be  seen  as  another  application  of  the  SIDE  framework, 
this  time  for  the  backwards  dataflow  problem  of  dead-code.  This  application  is  considerably  more  difficult, 
a  fact  reflected  in  the  complexity  of  their  algorithm. 

An  interesting  question  is  whether  this  technique  can  be  applied  to  other  dataflow  analyses,  and  whether 
its  application  can  produce  savings  by  eliminating  computations  more  expensive  than  the  inserted  code. 

7  Experimental  Evaluation 

7.1  Expected  Performance  Impact 

The  scalar  promotion  algorithm  presented  here  is  optimal  with  respect  to  the  number  of  loads  and  stores 
executed.  But  this  does  not  necessarily  correlate  with  improved  performance  for  four  reasons. 

Eirst,  it  uses  more  registers,  to  hold  the  scalar  values  and  flags,  and  thus  may  cause  more  spill  code,  or 
interfere  with  software  pipelining. 

Second,  it  contains  more  computations  than  the  original  program  in  maintaining  the  flags.  The  opti¬ 
mized  program  may  end-up  being  slower  than  the  original,  depending,  among  other  things,  on  the  frequency 
with  which  the  memory  access  statements  are  executed  and  whether  the  predicate  computations  are  on  the 
critical  path.  Eor  example,  if  none  of  them  is  executed  dynamically,  all  the  inserted  code  is  overhead.  In 
practice  profiling  information  and  heuristics  should  be  used  to  select  the  loops  which  will  most  benefit  from 
this  transformation. 

Third,  scalar  promotion  removes  memory  accesses  which  hit  in  the  cache,|3  therefore  its  benefit  appears 
to  be  limited.  However,  in  modem  architectures  El  cache  hits  are  not  always  cheap.  Eor  example,  on  the 
Intel  Itanium  2  some  El  cache  hits  may  cost  as  much  as  17  cycles  [ICL()3I].  Register  promotion  trades-off 
bandwidth  to  the  load-store  queue  (or  the  El  cache)  for  bandwidth  to  the  register  file,  which  is  always 
bigger. 

Eourth,  by  predicating  memory  accesses,  operations  which  were  originally  independent,  and  could  be 
potentially  issued  in  parallel,  become  now  dependent  through  the  predicates.  This  could  increase  the  dy¬ 
namic  critical  path  of  the  program,  especially  when  memory  bandwidth  is  not  a  bottleneck. 

7.2  Performance  Measurements 

In  this  section  we  present  measurements  of  our  register  promotion  algorithm  as  implemented  in  the  CASH 
C  compiler.  We  show  static  and  dynamic  data  for  C  programs  from  three  benchmark  suites:  Media- 
bench  [[LFMS97I].  Speclnt95  |Hti95^  and  Spec  CPU2000  (EMU]. 

Our  implementation  does  not  use  dirty  bits  and  therefore  is  not  optimal  with  respect  to  the  number  of 
stores  (it  may,  in  fact,  incur  additional  stores  with  respect  to  the  original  program).  However,  dirty  bits  can 
only  save  a  constant  number  of  stores,  independent  of  the  number  of  iterations.  We  have  considered  their 
overhead  unjustified.  We  only  lift  loop-invariant  predicates  to  guard  the  initializer;  our  implementation  can 
thus  optimize  Eigure  [T^  but  not  Eigure  ^  As  a  simple  heuristic  to  reduce  register  pressure,  we  do  not 
scalarize  a  value  if  it  is  not  reused  for  3  iterations. 

'®“Critical”  here  means  connecting  a  basic  block  with  multiple  successors  to  a  basic  block  with  multiple  predecessors. 

"Barring  conflict  misses  within  the  loop. 
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Program 

Variables 

Program 

Variables 

Invariant 

Strided 

Invariant 

Strided 

old 

new 

old 

new 

old 

new 

old 

new 

adpcm_e 

0 

0 

0 

0 

O99.go 

40 

53 

2 

2 

adpcm_d 

0 

0 

0 

0 

124.m88ksim 

23 

10 

1 

4 

gsm_e 

1 

1 

1 

0 

129.compress 

0 

0 

1 

0 

gsm_d 

1 

1 

1 

0 

130.1i 

1 

0 

1 

1 

epic_e 

0 

0 

0 

0 

132.ijpeg 

5 

1 

9 

5 

epic_d 

0 

0 

0 

0 

134.perl 

6 

0 

0 

1 

mpeg2_e 

1 

0 

1 

0 

147. vortex 

22 

20 

1 

0 

mpeg2_d 

4 

3 

0 

0 

164.gzip 

20 

0 

1 

0 

jpeg-e 

3 

0 

7 

5 

175.vpr 

7 

2 

0 

0 

jpeg-d 

2 

1 

7 

5 

176.gcc 

11 

40 

5 

2 

pegwit.e 

6 

0 

3 

1 

181.mcf 

0 

0 

0 

0 

pegwit_d 

6 

0 

3 

1 

197  .parser 

20 

3 

3 

5 

g721_e 

0 

0 

2 

0 

254.gap 

1 

0 

18 

1 

g721_d 

0 

0 

2 

0 

25 5.  vortex 

22 

20 

1 

0 

PgP-e 

24 

1 

5 

0 

256.bzip2 

2 

2 

8 

0 

PgP-d 

24 

1 

5 

0 

300.twolf 

1 

2 

0 

0 

rasta 

3 

0 

2 

1 

mesa 

44 

4 

2 

0 

Table  2:  How  often  scalar  promotion  is  applied.  “New”  indicates  additional  cases  which  are  enabled  by  our 
algorithm.  We  count  the  number  of  different  “variables”  to  which  promotion  is  applied.  If  we  can  promote 
arrays  a  and  b  in  a  same  loop,  we  count  two  variables. 


Table  @  shows  how  often  scalar  promotion  can  be  applied.  Column  3  shows  that  our  algorithm  found 
many  more  opportunities  for  scalar  promotion  that  would  not  have  been  found  using  previous  scalar  pro¬ 
motion  algorithms  (however,  we  do  not  include  here  the  opportunities  discovered  by  PRE).  CASH  uses  a 
simple  flow-sensitive  intra-procedural  pointer  analysis  for  dependence  analysis. 

Figure  ^  and  Figure  ^  show  the  percentage  decrease  in  the  number  of  loads  and  stores  respectively 
that  result  from  the  application  of  our  register  promotion  algorithms.  The  data  labeled  PRE  indicate  the 
number  of  memory  operations  removed  by  our  straight-line  code  optimizations  only.  The  data  labeled  loop 
shows  the  additional  benefit  of  applying  inter-iteration  register  promotion.  We  have  included  both  bars  since 
some  of  the  accesses  can  be  eliminated  by  both  algorithms. 

The  most  spectacular  results  occur  for  12  4  .mSSksim,  which  has  substatial  reductions  in  both  loads 
and  stores.  Only  two  functions  are  responsible  for  most  of  the  reduction  in  memory  traffic:  alignd  and 
loadmem.  Both  these  functions  benefit  from  a  fairly  straightforward  application  of  loop-invariant  mem¬ 
ory  access  removal.  Although  loadmem  contains  control-flow,  the  promoted  variable  is  always  accessed 
unconditionally.  The  substantial  reduction  in  memory  loads  in  gsm_e  is  also  due  to  register  promotion  of 
invariant  memory  accesses,  in  the  hottest  function,  Calculation.of _the_LTP_parameters.  This 
function  contains  a  very  long  loop  body  created  using  many  C  macros,  which  expand  to  access  several 
constant  locations  in  a  local  array.  The  loop  body  contains  control-flow,  but  all  accesses  to  the  small  array 
are  unconditional.  Finally,  the  substantial  reduction  of  the  number  of  stores  for  rasta  is  due  to  the  FR4TR 
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Figure  23:  Percentage  reduction  in  the  number  of  dynamic  load  operations  due  to  the  application  of  the 
memory  PRE  and  register  promotion  optimizations. 


function,  which  also  benefits  from  unconditional  register  promotion. 

The  impact  of  these  reductions  on  actual  execution  time  depends  highly  on  hardware  support.  The 
performance  impact  modeled  on  Spatial  Computation  (described  in  [[BCj()3L  IBud()3l]')  is  shown  in  Figure 
Spatial  Computation  can  be  seen  as  an  approximation  for  a  very  wide  machine,  but  which  is  connected  by 
a  bandwidth-limited  network  to  a  traditional  memory  system. 

We  model  a  relatively  slow  memory  system,  with  a  4  cycles  LI  cache  hit  time.  Interestingly,  the 
improvement  in  running  time  is  better  if  memory  is  faster  (e.g.,  with  a  perfect  memory  system  of  2  cycle 
latency  the  gsm_e  speed-up  becomes  18%).  This  effect  occurs  because  the  cost  of  the  removed  LI  accesses 
becomes  a  smaller /rach'on  of  total  execution  cost  when  memory  latency  increases. 

The  speed-ups  range  from  a  1.1%  slowdown  for  183.equake,  to  a  maximum  speed-up  of  14%  for  gsm_e. 
There  is  a  fairly  good  correlation  of  speed-up  and  the  number  of  removed  loads.  The  number  of  removed 
stores  seems  to  have  very  little  impact  on  performance,  indicating  that  the  load-store  queue  contention 
caused  by  stores  is  not  a  problem  for  performance  (since  stores  complete  asynchronously,  they  do  not  have 
a  direct  impact  on  end-to-end  performance).  5  programs  have  a  performance  improvement  of  more  than 
5%.  Since  most  operations  removed  are  relatively  inexpensive,  because  they  have  good  temporal  locality, 
the  performance  improvement  is  not  very  impressive.  Register  promotion  alone  causes  a  slight  slow-down 
for  4  programs,  while  being  responsible  for  a  speed-up  of  more  than  1%  for  only  7  programs. 
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Figure  24:  Percentage  reduction  in  the  number  of  dynamic  store  operations  due  to  the  application  of  the 
memory  PRE  and  register  promotion  optimizations. 
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8  Related  work 

The  canonical  register  promotion  papers  are  by  Steve  Carr  et  ah:  [ICCK9UL ICK94I].  Duesterwald  et  al.  [DGS93I] 
describes  a  dataflow  analysis  for  analyzing  array  references;  the  optimizations  based  on  it  are  conservative: 
only  busy  stores  and  available  loads  are  removed;  they  notice  that  the  redundant  stores  can  be  removed  and 
compensated  by  peeling  the  last  k  loop  iterations,  as  shown  in  Section  Lu  and  Cooper  [1LC97I]  study  the 
impact  of  powerful  pointer  analysis  in  C  programs  for  register  promotion.  Sastry  and  Lu  [BJ9S1]  introduce 
the  idea  of  selective  promotion  for  analyzable  regions.  None  of  these  algorithms  simultaneously  handles 
both  inter-iteration  dependences  and  control-flow  in  the  way  suggested  in  this  paper.  |]S.I9SL  lL.CK+981]  show 
how  to  use  SSA  to  facilitate  register  promotion.  |1I.CK+98T|  also  shows  how  PRE  can  be  “dualized”  to 
handle  the  removal  of  redundant  store  operations. 

Schemes  that  use  hardware  support  for  register  promotion  such  as  [[PCiMOOL  D094L  OCiOll]  are  radically 
different  from  our  proposal,  which  is  software-only.  Hybrid  solutions,  utilizing  several  of  these  techniques 
combined  with  SIDE,  can  be  devised. 

Bodik  et  al.  [IBCjS99I]  analyzes  the  effect  of  PRE  on  promoting  loaded  values  and  estimates  the  potential 
improvements.  The  idea  of  predicating  code  for  dynamic  optimality  was  also  advanced  by  Bodrk  [IBG97I]. 
and  was  applied  for  partial  dead-code  elimination.  In  fact,  the  latter  paper  can  be  seen  as  an  application  of 
the  SIDE  framework  to  the  dataflow  problem  of  dead-code.  Muchnick  |[IVluc97l|  gives  an  example  in  which 
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a  load  can  be  lifted  out  of  a  loop  because  it  occurs  on  both  branches  of  an  i  f  statement  (which  would 
optimize  our  Figure  ^  directly  as  shown  in  Figure  but  he  doesn’t  describe  a  general  algorithm  for 
solving  the  problem  optimally. 

9  Conclusions 

We  have  described  a  scalar  promotion  algorithm  which  eliminates  all  redundant  loads  and  stores  even  in 
the  presence  of  conditional  control  flow.  The  key  insight  in  our  algorithm  is  that  availability  information, 
traditionally  computed  only  at  compile-time,  can  be  more  precisely  evaluated  at  run-time.  We  transform 
memory  accesses  into  scalar  values  and  perform  the  loads  only  when  the  scalars  do  not  already  contain 
the  correct  value,  and  the  stores  only  when  their  value  will  not  be  overwritten.  Our  approach  substantially 
increases  the  number  of  instances  when  register  promotion  can  be  applied. 

As  the  computational  bandwidth  of  processors  increases,  such  optimizations  may  become  more  advanta¬ 
geous.  In  the  case  of  register  promotion,  the  benefit  of  removing  memory  operations  sometimes  outweighs 
the  increase  in  scalar  computations  to  maintain  the  dataflow  information;  since  the  removed  operations 
tend  to  be  inexpensive  (i.e.,  they  hit  in  the  load-store  queue  or  in  the  LI  cache),  the  resulting  performance 
improvements  are  relatively  modest. 


25 


References 


[AKPW83] 

[BG97] 

[BG02a] 

[BG02b] 

[BG03] 

[BGS99] 

[Bud03] 

[CCK90] 

[CK94] 

[CL03] 

[CMS96] 

[CSC+00] 

[CW95] 


R.  Allen,  Kennedy  K.,  C.  Porterfield,  and  J.D.  Warren.  Conversion  of  eontrol  dependenee 
to  data  dependenee.  In  ACM  Symposium  on  Principles  of  Programming  Languages  (POPL), 
pages  177-189,  Austin,  Texas,  January  1983. 

Rastislav  Bodrk  and  Rajiv  Gupta.  Partial  dead  eode  elimination  using  slieing  transforma¬ 
tions.  In  ACM  SIGPLAN  Conference  on  Programming  Language  Design  and  Implementation 
(PLDI),  pages  159-170,  Las  Vegas,  Nevada,  June  1997. 

Mihai  Budiu  and  Seth  Copen  Goldstein.  Compiling  applieation-speeifie  hardware.  In  Inter¬ 
national  Conference  on  Field  Programmable  Logic  and  Applications  (FPL),  pages  853-863, 
Montpellier  (La  Grande-Motte),  Franee,  September  2002. 

Mihai  Budiu  and  Seth  Copen  Goldstein.  Pegasus:  An  effieient  intermediate  representation. 
Teehnieal  Report  CMU-CS-02-107,  Carnegie  Mellon  University,  May  2002. 

Mihai  Budiu  and  Seth  Copen  Goldstein.  Optimizing  memory  aeeesses  for  spatial  eomputation. 
In  International  ACM/IEEE  Symposium  on  Code  Generation  and  Optimization  (CGO),  pages 
216-227,  San  Franciseo,  CA,  Mareh  23-26  2003. 

Rastislav  Bodik,  Rajiv  Gupta,  and  Mary  Lou  Soffa.  Load-reuse  analysis:  Design  and  evalua¬ 
tion.  In  ACM  SIGPLAN  Conference  on  Programming  Language  Design  and  Implementation 
(PLDI),  Atlanta,  GA,  May  1999. 

Mihai  Budiu.  Spatial  Computation.  PhD  thesis,  Carnegie  Mellon  University,  Computer  Sei- 
enee  Department,  Deeember  2003.  Teehnieal  report  CMU-CS-03-217. 

S.  Carr,  D.  Callahan,  and  K.  Kennedy.  Improving  register  alloeation  for  subseripted  vari¬ 
ables.  In  ACM  SIGPLAN  Conference  on  Programming  Language  Design  and  Implementation 
(PLDI),  White  Plains  NY,  June  1990. 

S.  Carr  and  K.  Kennedy.  Sealar  replaeement  in  the  presenee  of  eonditional  eontrol  flow.  Soft¬ 
ware  —  Practice  and  Experience,  24(1),  January  1994. 

Jean-Franeois  Collard  and  Daniel  Lavery.  Optimizations  to  prevent  eaehe  penalties  for  the 
Intel  Itanium  2  proeessor.  In  International  ACM/IEEE  Symposium  on  Code  Generation  and 
Optimization  (CGO),  San  Franeiseo,  CA,  Mareh  23-26  2003. 

S.  Carr,  Q.  Mangus,  and  P.  Sweany.  An  experimental  evaluation  of  the  suffieieney  of  sealar 
replaeement  algorithms.  Teehnieal  Report  TR96-04,  Miehigan  Teehnologieal  University,  De¬ 
partment  of  Computer  Seienee,  1996. 

Lori  Carter,  Beth  Simon,  Brad  Calder,  Larry  Carter,  and  Jeanne  Ferrante.  Path  analysis  and  re¬ 
naming  for  predieated  instruetion  seheduling.  International  Journal  of  Parallel  Programming, 
special  issue,  28(6),  2000. 

Steve  Carr  and  Qunyan  Wu.  The  performanee  of  sealar  replaeement  on  the  HP  715/50.  Teeh¬ 
nieal  Report  TR95-02,  Michigan  Technological  University,  Department  of  Computer  Science, 
1995. 


26 


[DGS93] 

[DHB89] 

[DKK+99] 

[D094] 

[KRS92] 

[LC97] 

[LCK+98] 

[LPMS97] 

[MLC+92] 

[MR79] 

[Muc97] 

[Nic89] 

[OBM90] 

[OGOl] 


Evelyn  Duesterwald,  Rajiv  Gupta,  and  Mary  Lou  Soffa.  A  practical  data  flow  framework 
for  array  reference  analysis  and  its  use  in  optimizations.  In  ACM  SIGPLAN  Conference  on 
Programming  Language  Design  and  Implementation  (PLDI),  pages  68-77.  ACM  Press,  1993. 

J.  C.  Dehnert,  P.  Y.  Hsu,  and  J.  P.  Bratt.  Overlapped  loop  support  in  the  Cydra  5.  In  In¬ 
ternational  Conference  on  Architectural  Support  for  Programming  Languages  and  Operating 
Systems  (ASPLOS),  pages  26-38,  April  1989. 

Carole  Dulong,  Rakesh  Krishnaiyer,  Dattatraya  Kulkarni,  Daniel  Lavery,  Wei  Li,  John  Ng,  and 
David  Sehr.  An  overview  of  the  Intel  IA-64  compiler.  Intel  Technology  Journal,  1999. 

Peter  J.  Dahl  and  Matthew  T.  O’Keefe.  Reducing  memory  traffic  with  CRegs.  In  lEEE/ACM 
International  Symposium  on  Microarchitecture  (MICRO),  pages  100-111,  November  1994. 

Jens  Knoop,  Oliver  Ruthing,  and  Bernhard  Steffen.  Lazy  code  motion.  In  ACM  SIGPLAN 
Conference  on  Programming  Language  Design  and  Implementation  (PLDI),  pages  224-234. 
ACM  Press,  1992. 

John  Lu  and  Keith  D.  Cooper.  Register  promotion  in  C  programs.  In  ACM  SIGPLAN  Confer¬ 
ence  on  Programming  Language  Design  and  Implementation  (PLDI),  pages  308-319.  ACM 
Press,  1997. 

Raymond  Lo,  Lred  Chow,  Robert  Kennedy,  Shin-Ming  Liu,  and  Peng  Tu.  Register  promotion 
by  sparse  partial  redundancy  elimination  of  loads  and  stores.  In  ACM  SIGPLAN  Conference  on 
Programming  Language  Design  and  Implementation  (PLDI),  pages  26-'il .  ACM  Press,  1998. 

Chunho  Lee,  Miodrag  Potkonjak,  and  William  H.  Mangione-Smith.  MediaBench:  a  tool  for 
evaluating  and  synthesizing  multimedia  and  communications  systems.  In  lEEE/ACM  Interna¬ 
tional  Symposium  on  Microarchitecture  (MICRO),  pages  330-335,  1997. 

Scott  A.  Mahlke,  David  C.  Lin,  William  Y.  Chen,  Richard  E.  Hank,  and  Roger  A.  Bringmann. 
Effective  compiler  support  for  predicated  execution  using  the  hyperblock.  In  International 
Symposium  on  Computer  Architecture  (ISCA),  pages  45-54,  Dec  1992. 

E.  Morel  and  C.  Renvoise.  Global  optimization  by  suppression  of  partial  redundancies.  Com¬ 
munications  of  the  ACM,  22(2):96-103,  1979. 

S.S.  Muchnick.  Advanced  Compiler  Design  and  Implementation.  Morgan  Kaufmann  Publish¬ 
ers,  Inc,  1997. 

A.  Nicolau.  Run-time  disambiguation:  Coping  with  statically  unpredictable  dependencies. 
IEEE  Transactions  on  Computers  (TOC),  38  (5):664-678,  1989. 

Karl  J.  Ottenstein,  Robert  A.  Ballance,  and  Arthur  B.  Maccabe.  The  program  dependence  web: 
a  representation  supporting  control-,  data-,  and  demand-driven  interpretation  of  imperative  lan¬ 
guages.  \n  ACM  SIGPLAN  Conference  on  Programming  Language  Design  and  Implementation 
(PLDI),  pages  257-271,  1990. 

S.  Onder  and  R.  Gupta.  Load  and  store  reuse  using  register  file  contents.  In  ACM  International 
Conference  on  Supercomputing,  pages  289-302,  Sorrento,  Naples,  Italy,  June  2001. 


27 


[PGMOO] 

[RC03] 

[SJ98] 

[Sta95] 

[StaOO] 


Matthew  Postiff,  David  Greene,  and  Trevor  Mudge.  The  store-load  address  table  and  speeu- 
lative  register  promotion.  In  lEEE/ACM  International  Symposium  on  Microarchitecture  (MI¬ 
CRO),  pages  235-244.  ACM  Press,  2000. 

Davide  Rizzo  and  Osvaldo  Colavin.  A  sealable  wide-issue  elustered  VLIW  with  a  reeonfig- 
urable  intereonneet.  In  International  Conference  on  Compilers,  Architecture,  and  Synthesis  for 
Embedded  Systems  (CASES),  San  Jose,  CA,  2003. 

A.  V.  S.  Sastry  and  Roy  D.  C.  Ju.  A  new  algorithm  for  sealar  register  promotion  based  on  SSA 
form.  In  ACM  SIGPLAN  Conference  on  Programming  Language  Design  and  Implementation 
(PLDI),  pages  15-25.  ACM  Press,  1998. 

Standard  Performanee  Evaluation  Corp.  SPEC  CPU95  Benchmark  Suite,  1995. 

Standard  Performanee  Evaluation  Corp.  SPEC  CPU  2000  Benchmark  Suite,  2000. 
http://www.speebeneh.org/osg/epu2000. 


28 


